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Abstract — A distributed data collection algorithm to accurately 
store and forward information obtained by wireless sensor 
networks is proposed. The proposed algorithm does not depend 
on the sensor network topology, routing tables, or geographic 
locations of sensor nodes, but rather makes use of uniformly 
distributed storage nodes. Analytical and simulation results 
for this algorithm show that, with high probability, the data 
disseminated by the sensor nodes can be precisely collected by 
querying any small set of storage nodes. 

I. Introduction 

Wireless sensor networks (WSNs) often consist of small 
devices (nodes) with limited processing ability, bandwidth and 
power. They can be deployed in isolated or dangerous areas to 
monitor objects, temperatures, etc. or to detect fires, floods, or 
other incidents. There has been extensive research on sensor 
networks to improve their utility and efficiency [15]. 

In this paper we consider a wireless sensor network Af with 
n nodes among which k = n(l — a) are sensing nodes and 
n — k are storage nodes, for small fractional a and k/n « 
80%. The sensor and storage nodes are distributed randomly 
in some region 1Z and cannot maintain routing tables or shared 
knowledge of network topology. Some nodes might disappear 
from the network due to failure or battery depletion. It is of 
interest to design storage strategies to collect sensed data from 
such sensors before they disappear suddenly from the network. 
Previous work on this problem has focused on situations in 
which either the network topology is known or the sensor 
nodes are able to maintain routing tables [8], [9], [12]. 

The authors in [1], [2] studied distributed storage algorithms 
for wireless sensor networks in different topology in which 
k sensor nodes (sources) want to disseminate their data to 
n storage nodes with low computational complexity, where 
k/n ~ 20%. They used fountain codes and random walks 
on graphs to solve this problem. They also assumed that the 
total numbers of sources and storage nodes are not known. In 
other words, they demonstrated an algorithm in which every 
node in a network can estimate the number of sources and 
storage nodes. In this work we solve the storage problem in 
WSNs by developing data collection algorithms with persistent 
storage nodes and dividing the region 1Z into smaller regions. 
We do not assume routing or topology propositions about the 
network, as was done in [4], [12]. We consider situations in 
which the sensor nodes are distributed uniformly in 1Z, and, 
again, they do not maintain any routing tables or network 



topology.There have been several clustering algorithms to 
aggregate nodes in wireless sensor networks. The most widely 
known are clustering by location or clustering using counters; 
see [3], [ 13]— [16] and references therein. The proposed data 
collection algorithm is suitable to use in terrains where we can 
not choose positions of the sensor nodes or the cluster heads. 
In this case, the system is self-stabilizing because if one node 
fails, no computations are needed to establish the cluster head. 

The rest of the paper is organized as follows. In Section II 
we present the network model and assumptions. The dis- 
tributed data collection algorithm is proposed in Section III 
and an analysis for this algorithm is presented in Section IV. In 
Section V, we demonstrate performance and simulation results 
for the proposed algorithm. In Section VI, we describe other 
work related to the proposed problem. Finally, the paper is 
concluded in Section VII. 

II. Network Model and Assumptions 

Assume a large scale wireless sensor network with a set of 
sensing nodes and a set of storage nodes. Both are distributed 
randomly and uniformly in a given region 1Z = L x L, where 
L is the side length. The sensing nodes have limited memory 
and bandwidth, and they might disappear from the network at 
any time due to limited battery lifetime. The storage nodes 
have large memory and bandwidth, but they do not sense 
information about the region. 

We assume that the data collector (base station) is far away 
from the nodes , but it is connected with a set of storage nodes. 
The sensor nodes are able to sense data and distribute it to the 
storage nodes. 

A. Assumptions 

We consider the following assumptions about the sensor 
network model Af: 

i) Let S = {si,...,Sfc} be a set of sensing nodes that 
are distributed randomly and uniformly in a given region 
1Z. All sensor nodes have the same capabilities such as 
mobility, homogeneous, limited memory and power. 

ii) Let R = {ri, . . . , r n -k} be the set of storage nodes 
such that (n — k)/n = 10% ~ 20%. This assumption 
differentiates between the work and problem considered 
in [1], [2], [12]. All storage nodes have the same amount 
of memory, power and bandwidth. 
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iii) The nodes do not maintain routing or geographic tables, 
and the topology of the wireless sensor network is not 
known. Each storage node r, can send multicasting mes- 
sages to neighboring nodes. Also, each node fj can detect 
its total number of neighbors by sending a simple flooding 
query message, and any sensor node that responds to this 
message will be a neighbor of this node. Therefore, our 
work is more general and different from the work done 
in [4], [6] which depends on the knowledge of network 
topology and routing tables. The degree d n (u) of a node u 
is the total number of neighbors with a direct connection 
with this node. 

iv) Each storage node has a memory buffer of size M and 
this buffer can be divided into smaller buffers, each of 
size c, such that e = [M/cJ. For simplicity we assume 
that all storage nodes have equal memory size M. 

v) Every node Sj prepares a packet packet S( with its ID Si , 
sensed data x Si , and a flag that is set to zero or one: 

packet Si (ID Si ,x H , flag) (1) 

vi) We will consider two different types of packets depending 
on the flag value: initialization and update packets. If the 
source node sends a packet and the flag is set to zero, then 
it will be considered as an initialization packet. Otherwise, 
it will be considered as an update packet. 

vii) The network is divided into clusters (sub-regions). Every 
cluster region is identified by a storage node ri, which 
exists in this cluster. Hence the storage node is also 
called the cluster head. Every storage node will accept the 
incoming packets with probability one, and will update its 
buffer if the flag is set to one. 

B. Distance Measurement and Clusters Distribution 

Since the sensing and storage nodes are distributed ran- 
domly, the distances between nodes are not known, but can 
be measured using the coverage radius of the nodes. When 
a storage node sends a flooding beacon message to all other 
sensing nodes, those sensing nodes that can receive this beacon 
will respond with reply messages. The storage node will accept 
these reply messages and decide to receive information from 
a node s, based on the following comparison: 

d riSj < S, (2) 

where d riS denotes the (Euclidean) distance between r, and 
Sj, and S is a fixed distance for all storage nodes. In this case 
if the distance d riSj is greater than S, then the sensing node 
Sj does not lie in the cluster identified by r*j [13]. 

III. Distributed Data Collection Algorithms 

In this section, we propose a distributed data collection 
algorithm for the storage problem proposed in the previous 
section. The clustering storage algorithm runs in the following 
phases: 

i) Clustering phase: We assume that the sensor network 
has k/n rj 80% sensing nodes, and (n — k)/n ~ 20% 
storage nodes. All clusters in the network are established 



Input: A sensor network with S = {si, . . . , Sk} source 
nodes, k source packets x Si , . . . , x Sk , and n — k 
storage nodes R = {n, r 2 , r n -kj- 

Output: storage buffers y\, y 2 , . . ■ , y n -k f° r a H storage 
nodes R. 

foreach storage node r, = 1 : n — k do 

Generate a beacon packet with its ID ri and send 
flooding message to all sensing neighbors; 
Every sensing node will decide the storage nodes to 
connect to; 

end 

foreach source node Si, i — 1 : k do 

Generate header of x Si and flag = 0; 

Prepare the packet Si ; 

Send the packet Si to storage nodes; 

end 

while source packets remaining do 

foreach node rj receives packets do 
if the flag=0 then 

Put x Si into r/s buffer; 

end 
else 

Update the yj buffer of the storage node rj 

Vi = Vi © X 8i '■> 

end 

end 

end 

Algorithm 1: DSA-I Algorithm: Distributed data collection 
algorithm for a WSN in which the data is disseminated using 
multicasting messages to all storage nodes. 



using clustering algorithms [13], [15]. In the clustering 
phase, each storage node sends a flooding beacon mes- 
sage with its ID to all neighboring nodes in the network. 
Due to the random locations of the sensing nodes, some 
nodes will be able to receive this message and reply with 
their IDs to the storage nodes. In addition the sensing 
nodes will store the IDs of the storage nodes in which 
they received beacon messages: 

packet ri -,,s{ID ri ) (3) 

ii) Sensing phase: In the sensing phase, the sensor nodes 
sense data from the environment. Once the data is 
collected, they send their packets to the storage node, 
from which they have received beacon packets: 

packet^ _>. fls . {ID Si , x Si , R St , flag) , (4) 

where R Si is the set of storage nodes with whom Sj is 
connected. The flag value determines whether the packet 
contains an update or initially sensed data. The update 
data from the sensing nodes will occur whenever they 
sense new information about the surrounding environ- 
ment. 

iii) Data collection and storage phase: When a sensing 
node senses the environment, it sends its packets to its 
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storage nodes. The storage nodes collect the incoming 
packets and store them encoded in their own buffer. 
Based on the type of the incoming packets, the storage 
nodes will store these packets or update the existing data 
in their buffers, 
iv) Querying phase: The query process can be done by 
the base station or server that collects all data from the 
storage nodes. In the following sections we will study 
the total number of nodes that must be queried in order 
to obtain the data sensed by the sensor nodes. 

IV. DSA-I Analysis 

In this section we will analyze the proposed data collection 
algorithm, which we call DSA-I. 

Lemma 1: With high probability, the data collector can 
retrieve information about the sensing nodes if 



e > k/(n — k), 



(5) 



where e is the number of buffers in each storage node. 

Lemma 2: The probability that a sensor sj lands in the 
range of a storage node is given by 



ir5 2 
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(6) 



Proof: We know that the sensor and storage nodes are 
distributed independently and uniformly in the region 1Z. So 
the probabilities that a randomly chosen sensor is Sj and a 
randomly chosen storage node is r; are given by 

1 . „ , , 1 



Pr( Sj ) 



k 



and Pr(rj) 



(7) 



Let us define the random variable X riS to indicate the event 
that one of the sensors sj lies within a radio range S of a 
storage node r^, for 1 < j < k and 1 < i < n — k: 



1, if d r 



< 5; 



0, if d r s . > 5. 



(8) 



where d nSj is defined in ( 2). We also define the random 
variable Y riS to indicate the probability that any of the sensor 
nodes sj lies within the range of a given storage node i\, so: 



Pr(y n 



1) 



ir5~ — a 
L 2 



(9) 



which is the area covered by the radio range within the 
region 1Z divided by the total area of 1Z, and a is the area 
of the portion of the radio range of the storage node that 
falls outside 1Z. The previous terms are obtained assuming a 
uniform probability distribution, therefore, the probability that 
a particular sensor Sj lies within the radio range of a storage 
node ri is obtained by multiplying Pr(s :) ) in (7) by (9), so, 

= 1) = ^fj^- (10) 

■ 

The following lemma follows from Lemma 2, and its proof 
is a direct consequence. 



Lemma 3: The probability that a sensor Sj lands in the 
range of all storage nodes R is given by 
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Proof: By Lemma 2, we know that the probability of one 
sensor located in one storage node is given by 



P(X d 



l) 
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Since all storage nodes are distributed randomly and uniformly 
in the region, then we have 

n— k 

f(x dRs .=i) = n p ( x ^<*) 

i=l 

"k5 2 - a\ n ~ k 
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We also turn our attention to study the probability of all 
sensor nodes sited at one particular storage node r^. In this 
case, the coefficients pn,Pi2,Pi3, ■ ■ ■ ,Pik of the i th in the 
storage code C are not zeros. In the other words, pij ^ 
for all j = l,2,...,k. 

Lemma 4: The probability of all sensors S sited in the 
range of a storage nodes is given by 



/■ttS" -a\' 
I L 2 k ) 



(12) 



V. Performance and Simulation Results 



In this section, we study the performance of the proposed al- 
gorithm for WSNs through simulation. The main performance 
metric we investigate is the successful decoding probability 
versus the query ratio. We assume a square region 1Z of size 
L x L in the plane, in which L = 100. Recall that, a sensor 
node lies in the coverage radius of a storage node if d ruS . < 5, 
in which 5 is covering radius of the storage nodes. 

Definition 5: (Storage Nodes Query Ratio) Let h be the 
number of storage nodes that are queried among the n' = n—k 
storage nodes in 1Z. Let 77 be the ratio between the number of 
queried nodes and the number of storage nodes n' , i.e., 

r) = - r (13) 

n 

Definition 6: (Revealed Sensors Ratio) We define the ratio 
of the number of sensor nodes k' ', in which their data is 
retrieved based on querying h storage nodes, to the total 
number of sensor nodes k as the revealed sensors ratio p: 



k' jk. 



(14) 



Definition 7: (Successful Decoding Probability) The suc- 
cessful decoding probability P s is the probability that the k 
source packets are all recovered from the h querying storage 
nodes. 

The main metric that we investigate is the revealed sensors 
ratio. It shows the amount of information that we successfully 
are able to obtain based on the proposed algorithm. We study 
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Fig. 1. Network model representing a wireless sensor network with 
sensing and storage nodes. The successful decoding probability increases with 
increasing the total number of network nodes. 



the relationship between the range of the storage nodes r\ and 
the revealed sensors ratio p. We first fix 77 and change the ratio 
between the range of the storage nodes and the region length 
L. 

Fig. 1 shows that increasing the number of network nodes 
and fixing the covering radius of each node will result in an 
improvement in the successful decoding probability as well. 
Particularly, for n > 500 and n' > 100, we see that querying 
up to 20% ~ 30% will reveal the sensed data about all k 
sensor nodes. Fig. 1 shows that the revealed sensors ratio will 
be the same and approximately equals one when rj is greater 
than 0.17 and also shows that for large number of sensors we 
will have larger p. 

In Fig. 2, we show the effect of increasing the percentage 
of queried nodes on the successful decoding probability when 
each storage node has 40 buffers and a radio range of 2 
distance units in this case of a square terrain of side length 
L = 100 distance units. The percentage of storage nodes 
is always 20% of the total number of nodes. Increasing 
the number of nodes has a positive effect on the successful 
decoding probability. When there are 250 total nodes, the 
nodes are more dispersed and with this small radio range, the 
storage nodes cannot reach all the sensor nodes and thus we 
are not able to decode more than 60% of the sensors' data. 
This can be improved if the radio range is increased, thereby 
allowing the storage nodes to contact more sensors. 

Fig. 3 shows the effect of increasing the radio range with 
respect to the terrain side length L when the buffer size can 
hold 50 sensor messages per storage node and 30% of the 
storage nodes are queried. As we increase the radio range, 
the number of encoded messages is increased. This makes 
decoding a much harder task until, at some point, no messages 
are decoded. When the number of nodes in the terrain is 
limited, 250 for example, the increase in the radio range results 
in an increase in the contacted nodes. This goes on until the 
radio range covers an area with a radius of almost 20% of 
the side length of the terrain area. Increasing the radio range 
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Fig. 2. Effect of changing the percentage of queried nodes when the number 
of buffers and the radio range of all the nodes are changed. Clearly increasing 
the number of nodes decreases the decoding performance due to lack of 
resources. 

further results in encoding more nodes that we are not able 
to decode and thus in a gradual decrease of the successful 
decoding probability. We can deduce from the curve that 
there is an optimal radio range for a network with a constant 
buffer size and node distribution beyond which the successful 
decoding probability decreases. 

The simulation results demonstrate that the proposed model 
is suitable for large-scale wireless sensor networks. Finding 
practical applications and network topologies in which this 
data collection algorithm can be deployed are directions for 
our future work. 

VI. Related Work 

In this section, we review some previous work in distributed 
data collection which is relevant to our work. 

• Dimakis et al. in [5] and [7] used a decentralized imple- 
mentation of fountain codes that uses geographic routing 
and every node has to know its location. The motivation 
for using fountain codes instead of using random linear 
codes is that the former requires O(fclogfc) decoding 
complexity but the later such as RS codes requires 0(k 3 ) 
decoding complexity in which k is the number of data 
blocks to be encoded. 

• Lin et al. in [11] and [ 1 2] studied the question "how 
can we retrieve historical data that the sensors have gath- 
ered even if some sensors are destroyed or disappeared 
from the network?" They analyzed techniques to increase 
"persistence" of sensed data in a random wireless sensor 
network. They proposed two decentralized algorithms 
using fountain codes to guarantee the persistence and 
reliability of cached data on unreliable sensors. They used 
random walks to disseminate data from a sensor (source) 
node to a set of other storage nodes. The first algorithm 
introduces lower overhead than naive random-walk, while 
the second algorithm has lower level of fault tolerance 
than the original centralized fountain code, but consumes 
much lower dissemination cost. 



5 



°- 7 foV.t 



\ 



Ratio of Radio Range to L" 



n=1000.n'=0.2,£=50 and Queried=30% 
n=1250.n'=02,e=50 and Queried=30% - 
n=1500.n'=02,t=50 and Queried=30% 
n=1750.n'=02,e=50 and Qiieried=30% 



n=250,n'=02,e=50 and Queried=30% 
n=500,n'=0.2,e=50 and Queried=30% - 
n=750,n'=0.2,e=50 and Queried=30% 



Fig. 3. The effect of increasing the radio range with respect to L. The 
maximum radio range for better decoding performance depends on the number 
of nodes in the system. 



Kamara et al. in [9] proposed a novel technique called 
growth codes to increase data persistence in wireless 
sensor networks, i.e. increasing the amount of information 
that can be recover at the sink. Growth codes is a 
linear technique that information is encoded in an online 
distributed way with increasing degree. They defined 
persistence of a sensor network as "the fraction of data 
generated within the network that eventually reaches the 
sink" [9]. They showed that growth codes can increase 
the amount of information that can be recovered at any 
storage node at any time period. 

Aly et al. in [1], [2] and [10] studied a model for 
distributed network storage algorithms for wireless sensor 
networks where k sensor nodes (sources) want to dissem- 
inate their data to n storage nodes with less computational 
complexity. The authors used fountain codes and random 
walks in graphs to solve this problem. They also assumed 
that the total numbers of sources and storage nodes are 
not known. 
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VII. Conclusion 

In this paper, we have studied the distributed storage prob- 
lem in large-scale random wireless sensor networks, in which 
there are sensing and storing nodes uniformly distributed in 
a region. We have proposed a data collection algorithm to 
precisely collect sensed data and successfully store it at storage 
nodes. The simulation results show that, with high probability, 
querying only 30% of the storage nodes with limited or 
unlimited buffers will retrieve all sensed data gathered by the 
sensing nodes. 
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